Robust speech recognition with data bank accession organized by semantic attribute

ABSTRACT

A method for controlling an information system during the output of stored information segments via a signaling device ( 50   a ). Useful information is stored in a database ( 32 ) for being requested, from which information at least one information segment is specified as a first data segment (W 1 ) via a first voice signal (s a (t),s a (z)) and is provided via a control output ( 20,40,50;50   a ) or is converted ( 50   b ) into a control signal for a technical device (G). The information is organized in the database such that an initially limited first information area ( 32   a ) of stored information is accessible ( 4,4   a,   4   b ) to said voice signal, for selecting the specified information segment therefrom. A further information area ( 32   b,   32   c,   32   d ) of said database ( 32 ) is activated ( 59,70,4   c,   4   d ) as a second information area, if the information segment (W 1 ) corresponding to a first voice signal segment (s 1 ) of said first voice signal (s a (t) is not contained in said first information area ( 32   a ). When accessing information of the database, a robust word recognition is obtained and the request is successfully processed within a short time.

RELATED APPLICATIONS

This application claims priority to German Patent Application No. 100 54413 4, filed on Nov. 3, 2000; German Patent Application No. 101 07 3364, filed on Feb. 16, 2001; and PCT International Application No.. PCT/EP01/12632, filed on Oct. 31, 2001, entitled “Robust Voice RecognitionWith Data Bank Organization”, the contents of which are herebyincorporated by reference herein in their entirety.

The invention relates to a method for controlling a system supplyinginformation, the input to said information system being a voice control.Certain information data (information segments) of a database areacceded and recalled (specified and selected) by voice, said databaseproviding one of an initiating function of a device and a perceivableinformation on said information segment via an output path. Both resultsare designated as “output of information”.

In order to make information available by an audio inquiry, theinquiring person must be permitted to inquire by speaking as naturallyas possible. Simultaneously, a database and voice recognition systembehind an audio inquiry must designed robustly to achieve a naturalinterface between man and machine, in order to provide correctinformation or to cause correct functions, and in order not to make aninquiring person wait too long for an information or a requestedfunction of the device, thus to react quickly. According to prior art,there are different approaches of increasing the robustness of a voicerecognition system, e.g. an improved acoustic modeling of words, compareSchukat/Talamazzini in “Automatische Spracherkennung—Grundlagen,statistische Modelle und effiziente Algorithmen”, Vieweg, 1-995, pages199-to 230. Improved linguistic and semantic speech models are anotherpossibility. However, it has recently been found that both approachesaccording to prior art only provide an insignificant reduction of theword error rate, which shall be improved.

In order to explain the complexity of a voice recognition, the essentialcriterion of optimizing speech models, namely perplexity, shall bereferred to. Perplexity describes an average branching degree of alanguage or voice. If perplexity is low, the quantity of words, whichcan possibly follow a recognized word, decreases simultaneously. Fordictation recognition systems, which can normally recognize and processa very large vocabulary of more than 60,000 words, a suitable speechmodel can considerably reduce the complexity of searching an optimumword chain. For an information inquiry,the basis is similar, a voiceinterface exists which records an inquiry and shall providethe relatedinformation from a database or carry out a corresponding action. Anumber of attributes, which shall be filled with different definitionsby said voice recognition, generally defines an inquiry in acousticform. The field content of a database required for a recognitionpossibly assumes a plurality of different values, e.g. all forenames orall surnames in case of a telephone information system, or the names ofall cities in case of a traffic information system, so that a highdegree of branching of the language or voice is present exactly at apoint that is critical for an information retrieval, i.e. a very largenumber of hypotheses to be checked by a voice recognition (wordrecognition) is present. Due to said large number of hypotheses to beregarded, the probability of associating incorrect data (of thedatabase) to an attribute increases on the one hand, on the other hand,the processing time for a decoding (recognition) of a verbal statement,i.e. a voice signal, increases.

From WO-A 00/14727 (One Voice Technology), an interactive interface isknown which recognizes speech and which is applied in connection withsaid interface between a verbal statement and a user interface of acomputer. Two “grammar-files” are used, compare pages 4, 5, one specialfile of which is initially used as a first database, for subsequentlyacceding to a second (more general) database with the complete spokenstatement in order to retrieve a possible deficiency of recognized wordsor sense. An interactive inquiry is used for this purpose, saidinteractive inquiry in case of a negative recognition error in the firstdatabase offering a recognized sentence/sense (a “prompt”) to the user,which sentence/sense had a maximum confidence value in the first wordrecognition. depending on a then required input of the user (Yes or No),either a global offer of a certain context is listed on the screen,compare page 17, last paragraph, or, if the inquiry is confirmed, acontrol of the computer is effected depending on the recognized sense.It is disadvantageous that a re-inquiry at the user (prompt) isregularly required and that the system is only able to recognizereliably, when the user has confirmed the content of the recognizedsentence with a maximum confidence.

A technical problem of the invention is to provide a word recognitionsystem for access to information of a database (by which also other datasources are understood), which recognition system is as robust aspossible, and to process an inquiry successfully in a shortest possibletime. This shall also be possible when the definitions of the parametersrequired for a specification of the database inquiry are of varioustypes, which increases the probability of a recognition error of exactlysaid definitions.

The invention solves said problem. The basic idea of the invention isthat a grouping/structuring of a database into smaller segments is ofadvantage. A word recognition can be processed more robustly, if, withinan initially activated database segment (a limited first informationarea) the number of definitions is reduced. Thus, a more robust and morerapid system is obtained, which responds more rapidly to an inquiry oreffects a desired action more rapidly.

A partition into database segments (designated as limited informationareas in the claim) can for example be influenced by certain user oruser group profiles which preferably have influence on the firstinformation area which is initially activated.

By limiting possible branchings, the degree of branching is reduced andthe alternatives for a more rapid evaluation by the word recognition arereduced.

Initially, a first information area of a database is activated, fromwhich an information segment is requested via a first voice signal. Saidinformation segment corresponds to a voice signal segment of a completevoice signal (or vice versa). For example, a relevant voice signalsegment can be a surname “Meier” or Fisher”, the related informationsegment being a certain telephone number stored in the database. Anotheralternative would be a voice signal segment “Nuemberg” or “New York”, towhich voice signal segment, train connections have to be retrieved, sothat the related information segment can be departure times of trains,when the departure station is determined before.

The database is organized to initially make a limited area of storedinformation accessible for the voice recognition. If a specifiedinformation (corresponding to the voice signal segment) cannot bedeserved by said area, i.e. if said information can not be selected fromsaid area, or if a confidence value, characterizing the reliability, isbelow a threshold value, a further information area is activated insteadof said first information area. Said further information area can belarger, displaced or totally different.

Said second (further) information area can be expanded to a thirdinformation area, if no corresponding information segment in said secondinformation area can be associated to said voice signal segment.

If an association to a voice signal segment of the voice signal is notpossible in a presently accessible (or activated) information area, atleast an attribute is associated to said voice signal segment forevaluation, said attribute being determined from one out of severalsemantic categories, for said determination compare Gallwitz, Noeth andNiemann, “Recognition of Out-of-vocabulary Words and their SemanticCategory”, Proceedings of the 2^(nd) SQEL Workshop on Multi-lingualInformation Retrieval Dialogs, Plzen, April 1997, pages 114 to 121. Thesemantic categories replace the concrete relation to an informationsegment in a database. They correspond to attributes, which describetypes of unknown words, particularly of words, which were not recognizedwith a sufficient confidence.

A consequence of said evaluation of attributes is an expansion of thevocabulary of the database, thus a modification of the information areathat is made accessible to the voice and word recognition. In thisrespect, at least a non-recognized signal segment is evaluated, whichsignal segment is already available and does not have to be recordedagain. Said signal segment is less perplex, particularly its temporallength is shorter, and can be evaluated more rapidly and more reliablyby said voice recognition.

The expansion can depend on the attribute, particularly on the semanticcategory, which is associated to the unknown voice signal segment . Saiddependency is at least a co-dependency, not an exclusive dependency.Despite an initially limited information area of the database, thesemantic categorization permits to rapidly and robustly associate adefinition or correspondence of the database, said data being desired asa specific information segment. The number of allowed word chains (thedegree of branching) is considerably reduced due to this kind ofinquiry, and therefore the plurality of hypotheses to be regarded, sothat the probability of attributing a wrong definition is reduced andthe processing speed can be increased.

Said new evaluation of an already available voice signal segment of saidfirst voicesignal can also be regarded as a “re-analysis”, for which notagain the complete content of the preceding voice signal is brought inconnection with the second selected information segment, but only analready available signal segment, the confidence value of which was notsufficiently high, “is connected” again with said second informationarea for a recognition. The dependency of the second information area onthe attribute is a possibility of improving the recognition withoutrequiring the user to act again, to respond to a “prompt” (systeminquiry), or to be asked (“prompted”) for a supplemental lingualinformation. A system operating in this manner is more intelligent andmakes multiple use of available information, thus re-analyzes saidinformation, and that not completely, but only at least one partthereof. Said part may also comprise two or more parts that were notrecognized with a sufficient confidence, so that a selection of afurther information area can depend on at least one attribute, but alsoon two attributes. Therefore, the selection of the second informationarea is at least co-determined by the attribute of the signal segmentwhich has not been recognized with sufficient confidence (in a timeinterval).

When dividing said two recognition stages with a limited firstinformation and with a further information area in an analysis and in are-analysis (a first and a second analysis, and possibly subsequentanalyses, it is evident that the voice system appears to the user to bemore robust and more independent, than with a necessary further inquiryas described above.

This does not exclude that in case of a failure of a search structurecontrolled in this manner, a dialog structure follows later, when thedependency on the attribute does not result in recognition with asufficient confidence value in said further information area. Initially,however, said system is adapted to make a further recognition attemptusing the described re-analysis, said re-analysis operating with areduced signal segment from a signal that is already available.

Another possible consequence of an attribute evaluation is a signaloutput adapted to, influenced by, or depending on the semantic category,as a request to receive a new voice signal from the user. Said signal isusually less perplex, particularly of a shorter temporal length, and canbe evaluated more rapidly and more reliably by the voice recognitionsystem.

A corresponding arrangement, comprising circuits or realized by softwarein connection with a program, effects the described method stages, sothat it is evident, which functional elements are used e.g. for theinitially limited allowable information, and for a subsequent permissionof access to an expanded information area, for retrieving an unknownvoice signal segment replaced (specified) by a category.

An output can be made by an action, visually, or acoustically, saidoutput is perceivable.

The control of the device can be realized in a telephone system,establishing a connection with a desired interlocutor who is registeredin a database by a telephone entry, said entry being the informationsegment addressed by a voice signal.

The attributes (the semantic categories) characterizing, replacing, orbeing associated to a voice signal segment, which is not available in apresently activated information area of a database, can for example besurnames, forenames, companies, departments, or functions, each of saidattributes being for example represented by several OOV words or modelsand being correspondingly considered, when modifying the availableinformation area.

Said several OOV models (OOV is an abbreviation for Out Of Vocabulary,thus missing of a corresponding word in a presently activated oravailable vocabulary of a database, which is available to the voicedecoding as recognition vocabulary) can refer to different languages,thus one model for e.g. German and one for English. Said recognitionsupports the recognition of a semantic category of the word, which wasnot recognized or recognized with insufficient confidence, said wordcorresponding to a voice signal segment of a voice signal.

An OOV model is an acoustic model in which different part models areconnected in parallel. Part models as acoustic models (or phonemicmodels) are used, which models can be language-dependent. An OOV modelcomprises several part models which cooperate with a kind of transitionprobability, said probability providing information on the probabilityof transition from one part model to another part model, said partmodels being elements of said OOV model. The completeness of an OOV is aresult of the total of all part models and of an evaluation orpredetermination of transition probabilities between said part models.

As an explanation, it is mentioned that a not-recognized status or anot-contained status in the activated database area is represented by athreshold value comparison. In this respect, the basic thesis is that aninsufficient confidence value, which is below a threshold value,corresponds to a not-recognized or not-contained status. If theconfidence value is above a threshold value, which can for example bebetween 80% and 95%, a signal segment is assumed to be correctlyrecognized, and the related database segment is output by an action,visually, or acoustically, so that the result is perceivable withrespect to its general sense.

If a confidence value could not exceed a threshold value inquiry, iftherefore the confidence is insufficient, a semantic category determinedor made available during voice decoding has to be taken as a basis,instead of a recognized signal segment, said semantic category being amore general attribute of the voice signal segment. Said category canalready be determined during voice recognition, but it can also bedetermined later, when the confidence value is too small for wordrecognition.

Therefore, an evaluation of the semantic category is preferably providedfor determining a “system response”. Said system response as an innersystem adaptation either resides in a modification of the informationmade accessible by a database for a voice signal recognition, or in anoutside perceivable system response by a perceivable output, which isdependent on the semantic category. A “correspondence to a first voicesignal segment” is for example a semantic category.

When a segment or interval of the voice signal is the term “Fischer” or“New York”, the corresponding semantic category is “surname” or “city”.Insofar, the term “substantial interval or segment” is to be understoodas the segment or interval of the voice signal which segment or intervalis to be recognized by the voice recognition to request the use(ful)information from the database. It is exactly because the confidencevalue is too small, that a sufficiently reliable association to thedatabase segment could not be obtained, and the subsequent stageaccording to the invention, is an evaluation of a further attribute,said attributes comprising both the confidence value and the semanticcategory. A recognized word is not an attribute, because insofar, thefirst attribute (confidence value), by exceeding the threshold value,indicated that a sufficient confidence of having recognized the correctword is present.

Of course, a voice signal can be provided in an analog or in a digitizedform. It does not have to be currently pronounced, but the first signalcan be available as a voice file, and it can also be supplied over alarger distance via a transmission channel. Said signal is processed bya decoding means, which can be designated as voice recognition(recognizer, word recognizer), which voice recognition is known in priorart as a model and an algorithm, compare Schukat/Talamazzini,“Automatische Spracherkennung—Grundlagen, statistische Modelle undeffiziente Algorithmen”, Vieweg 1995, as mentioned above.

A dialog control can help to simplify complex recognition problems. Afirst and a second voice signal are taken as a basis.

The first voice signal is the signal by which the desired informationsegment shall be specified. If said information segment cannot be madeaccessible due to a modification of the available information area, ifparticularly attributes for non-recognized voice segments alone do nothelp, a perceivable output is effected as a feedback.

A non-recognized segment of a voice signal causes an optical or acousticinformation that depends on the attribute the non-recognized voicesignal segment has. If a surname is concerned, a name can concretely berequested. If a city or a holiday destination is concerned, an adaptedperceivable feedback to the user can take place. The concrete worditself can be included as a voice signal segment in said feedback.

Then, a second voice signal is expected by the recognition proceduresecond alternative), which voice signal considers said inquiry, theexpected scope of perplexity (or the degree of branching) being verysmall, so that the complete use information of the database can be madeavailable for the word recognition. However, this area can be limited toan area determined by the attribute (the category) of the non-recognizedvoice signal segment. Reasonably, also a much shorter signal length canbe assumed, which corresponds practically completely to the substantialsignal segment to be recognized.

Said process can be repeated until a suitable correspondence of theattribute is found in the stored vocabulary of the database. If thiscannot be achieved, the process changes to the first stage firstalternative), outputs a standard inquiry, and repeats the at leastsingle step-wise expansion of the information area.

The output of a system information for requesting a second voice signalto be provided, which request is oriented such that the second voicesignal is a repetition of the first signal that cannot be evaluated,represents a relapse position.

The iterative determination (analysis, re-analysis) of the specificinformation segment can operate without a dialog request; if, in thiscase, no definite association of the not yet determined voice signalsegment to an information segment is obtained after reaching the laststage, i.e. the complete information area of the database, a dialoginformation can also be output in this case, for requesting a secondvoice signal with a considerably reduced perplexity. A completelyautomatically operating variant is to start at the first informationarea again with an analysis and a (at least one) re-analysis, without adialog structure.

The first information area can be adapted to a user (also to a group ofusers), by recording specific information segments of a user who isknown in said database with regard to his behavior or his characteristicfeatures (insofar with regard to at least one feature), such aspreferred connections, usual interlocutors, speech behavior, dialogbehavior and the like. Said profile-oriented control of the first dataarea (of the first activated information area) helps to limit theperplexity of the complete voice signal to the relevant information assignal segments, which are to be acceded in expanded information areasfor correspondences.

If the search is successful, a corresponding input can be made to theprofile of the currently connected user, to allow a direct access forthe next use. The information frequently acceded by a specific user(also in the sense of a user group), can be made available quickly androbustly, and simultaneously, an access to rare information is provided.

The invention is described in more detail and supplemented byembodiments.

FIG. 1 shows a diagram, which is a mixture of a block diagram and afunctional sequence chart, in the sense of a flow diagram.

FIG. 2 is a voice signal s_(a)(t), which is illustrated as an analogsignal and comprises a voice signal segment s1 during a time intervalT1.

FIG. 3 is a self-explanatory diagram of the operating method of themodification of a memory area accessible to a decoder.

FIG. 3 a, 3 b, and 3 c illustrate exemplary database segments

FIG. 4 is an illustration of an acoustic model for an OOV.

The embodiment of FIG. 1 illustrates a structure of a database systemcomprising an acoustic inquiry means via a channel 1 a. Said illustratedstructure can be realized by software. In parallel, FIG. 1simultaneously is a kind of flow diagram illustrating how the innerlogic of the system operates.

A voice signal s_(a)(t) as an analog signal or as a digitized signals_(a)(z) is schematically illustrated in FIG. 2. Said voice signal isrecorded over a microphone 1 by a means 2; it can also be provided as afile or supplied via a transmission channel. A possible realization ofFIG. 1 in a concrete embodiment is a telephone system, which providesinformation 50 a via a functional block 50, alternatively starts afunction 50 b, or establishes a dialed connection. Said “action” shallcharacterize the control of the device, in the following reference beinggenerally made to providing information or to an output of informationcontained in a database 32.

Said microphone 1 is specifically illustrated at an input, but not saidtransmission channel or said file comprising said voice signal, so thatsimply an input to a decoder 4 is taken as a basis which decoder issupplied with a voice signal (of optional origin). Decoding correspondsto a conversion of voice into text using a word recognition means 4(word recognition or voice recognition), additionally attributes beingdetermined, such as a semantic category S1 of a signal segment s1 or aconfidence value K1 for reliably recognizing a word W1.

A database 32 as illustrated comprises several areas 32 a, 32 b, 32 c,and 32 d, characterized by different stored information. Saidinformation is made accessible to a decoder 4, which, based on its voicemodel, tries to associate (“allocate” or “assign”) word terms of saiddatabase to voice signals, better: to segments of a voice signal.

If, based on a voice signal, said decoder 4 correctly recognizes aninquiry, said correct recognition is signalized via a control line 40,by which correct recognition, information 50 a is provided or an actionis initiated by a functional block 50 b. If a spoken statement (as avoice signal s_(a)(t)) is not completely recognized, functional paths59,59 a are used via 21 and 60, said functional paths being explainedlater.

The embodiment can serve as an exchange system, e.g. by telephone, whichupon indication of a name establishes a connection with the indicatedperson. Said database 32 contains all names, companies, forenames,possibly departments or functions, additionally standard words arecontained which are applied in normal usage, for example terms like“what”, “I”,“Mister”, “Misses”, and different welcome phrases. Saidstandard words are comprised in a first information area 32 a.Additionally, user-dependent features or properties are considered here.A user “Jack” identifies himself, or is authenticated on the basis of asignal of a secondary information 9, and releases a stored profile whichis one of several profiles accessible in a memory 10. Said profile 10 a,10 a, . . . is supplemented to said information area 32 a, storedseparately, or defines a certain portion of said information area 32 a,which portion is adapted to the speaker of the voice signal, orcorresponds to said speaker.

An initial basis is the usual case that, despite a plurality of storeddata, according to experience, a user phones 200 to 400 persons atmaximum, so that said user-dependent 200 to 400 persons are accessibleover said first area 32 a of information, when a profile 10 a, 10 brelated to a certain user is selected.

According to an example, the operating process of a system response witha user inquiry is as follows. A system prompt is output via aloudspeaker 50 a or via a channel for supply into a telephone receiver,said system prompt being for example “How can I help you?” The userresponds by a first voice signal s_(a)(t), for example “I want to talkto Mr. Mueller”. The system proceeds to a voice recognition using saidfirst supplied voice signal, whereby it is supposed that the signalsegment s1 during a time interval T1according to FIG. 2 corresponds tothe term “Mueller” (as a semantic category: surname).

If the decoder 4 retrieves specific information data with respect to thevoice segment “Mueller” (for example his telephone number) in said firstinformation area 32 a of said database 32, the system can select saiddata and respond to the inquiry by providing an information via achannel 20,40,50, for example “You are connected with Mr. Mueller”.

The above described system process is the most desired, quicklyoperating way, for said voice recognition 4 (recognizing module) torecognize the recorded signal, from a microphone 1 via a signal supply 1a after recordation at 2, with such high confidence K1 that anallocation of the related stored data of the first information area 32 aallows an output of the information for a signal path 40 (in thisembodiment of a word W1, particularly of a telephone numbercorresponding to the word “Mueller” in the data base 32) to a controlmeans 50 and a loudspeaker 50 a. As schematically illustrated in saidprocess control, at first an initial selection of the data base via aconnection 4 a,4 b is effected by a function 5 provided for thispurpose, i.e. a first data base area 32 a is made accessible to arecognition means 4 via a data line 4 b. Accessibility can also beprovided by a transfer of the complete data block in a DMA procedure, sothat said recognizing module 4 can have access to the data available insaid data base area 32 a.

According to the described embodiment of the telephone system, a correctrecognition is to be supposed by recognizing the word “Mueller” with asupposed high confidence, i.e. with said confidence value K exceeding aminimum threshold value. In fact, said correct recognition causes twofunctions, namely an output of a system confirmation (you are connectedwith Mr. Mueller), and the performance of an action as an actuatingreaction via 50 b, for transmitting an information to a technical deviceG, the telephone system of the present embodiment, which effectivelyestablishes a corresponding connection.

The attributes S,K marked in a schematically illustrated block 18 ofrecognition values show the result of the recognition module 4, whichadditionally provides a word recognition. Said two attributes are thesemantic category S and the confidence value K which are associated to atime interval T1, as shown in FIG. 2, and to a marked signal segment s1of a voice signal s_(a)(t). A request 20 compares the attribute“confidence value” K1. In case the threshold is exceeded, the recognizedword W1 is output via a path 40, or related relevant information isprovided over a loudspeaker 50 a, and/or a related action is providedfor initializing a connection 50 b. Therefore, only the portion W istransmitted by the recognition values 18, after evaluation of the secondportion K.

How the time interval T1, which is the substantial portion of the signalwith regard to the desired request, is calculated or determined, is notdescribed in more detail here. Said time interval can be determinedbased on the complete signal, based on acoustic models, based on keywords, or also based on a semantic analysis of the complete voicesignal, as shown in FIG. 2. Such calculations have been described asprior art hereinbefore, so that the control is based on the fact thatthe determined use information W, S and K is available at an output 19of said recognition 4, said comparison 20 evaluating the confidencevalue K of said use information and, provided that said confidence valueis sufficient, forwarding said word W to an information output 50, inthe above described sense.

If an entry “Mueller” is not present in said information area 32 a ofsaid supposed first user, if it is therefore not retrieved in a firstdata segment, the operating process of the system and the systemresponse, respectively, are modified, supposing that no correspondingword entry W1 of said first area 32 a can be allocated to the signalsegment “Mueller” during the time interval T1. Therefore, the decoder 4causes the signal segment s1 to be characterized over a line 19, byallocating the still unknown word to an attribute (semantic category).Said word corresponds to a type, thus to an attribute, which in thiscase is designated “surname”. Via a loudspeaker 31 a or via a line ofthe telephone receiver, an s₃₁(t) output can be provided as a systemresponse, that “the name you mentioned is not available in the telephoneregistry; to which person do you want to talk?” The attributecharacterization allocates the unknown segment s1 of the voice signal toa semantic category. Due to said semantic categorization, the systemoutputs a specific inquiry 31 a, modifying the system with regard to anexpected second voice signal s_(b)(t), which is illustrated in FIG. 2.For a new iteration, the system adapts itself at 35 to the second voicesignal having a reduced perplexity, thus a smaller degree of branchingof the syntax and the words, namely, the system only expects a surname.Normally, this signal segment T_(b) is shorter in time and can beattributed substantially completely to a data base entry. For asubsequent voice recognition with the same decoding means 4, a largerinformation area, at any rate an at least partly different informationarea of the database, can be connected via 4 c,4 d or activated, thusthe complete area 32 of all segments 32 b, 32 c, 32 d, said firstsegment 32 a, which is already present, remaining connected or beingdeactivated.

Due to the low perplexity, the term now supplied with the second voicesignal s_(b)(t) can reliably be retrieved in the modified area or in thecomplete file of use information, and a recognition can be signalizedvia a channel 40,50,50 a.

The above-described operating method of the system as a modified systemresponse (in the sense of an inner modification of the function) shallbe described in more detail on the basis of the comparison with theconfidence value in the request 20, which comparison is also describedabove. It is supposed that the confidence value is below the mentionedthreshold value, which can be around 80%, 90%, or 95%, at any rate below100%. Based on the thesis that an insufficient confidence value K1 forthe “essential” portion s1 (essential in the sense of essential for theoperating method and not required in the sense of a temporal long orlarge segment) corresponds to a non-recognition or a not-containedstatus of the searched word W1, a further attribute of the recognitionresult 18 is evaluated, namely the semantic category, which isdesignated S1 for the time interval T1 according to FIG. 2. Saidsemantic category is forwarded via a sequence path 21, corresponding toa signal path or a program expansion or branch, and transmitted to acontrol 60, inquiring whether a semantic category has been recognized bysaid recognition module 4. The semantic category was explained above. Itcorresponds to the type of the non-recognized word W1 in the segment T1.If such a category is present, the system can expand or branch in twopaths, which are provided both alternatively and cumulatively. Said twoexpansions or branches are provided at 59 and 59 a.

Path 59: If a semantic category is present, the access to the database32 is modified with respect to the recognition module 4 by afunctionality 70, that can be provided as a program or a processcontrol. Said modification of access is provided via marked signal paths4 c and 4 d, reference being made to making a second database segmentaccessible, as well as transferring the complete data of anotherdatabase segment, e.g. of a segment 32 b, to said recognition module 4.Said other database segment is explained symbolically in FIGS. 3, 3 a, 3b, and 3 c. Said function 70 effects a database selection in accordancewith the present semantic category. This corresponds to a modificationof the data segment 32 a which has so far been available for saidrecognition. A completely different segment 32 b can be connected, forexample a segment containing all surnames. The database can be expanded,as symbolized by 32 c. This corresponds to a larger database segment ina larger address segment. Finally, a partly overlapping combination 32 dwith said first data segment 32 a of said memory 32 is possible, asshown in FIG. 3 b.

Said FIGS. 3 a, 3 b, and 3 c illustrate according to the theory of setswhich information segment of the database is accessible via said control70 subsequently to said initial data segment 32 a. A second databasesegment 32 b that is completely separate from said first databasesegment 32 a can be made available. A partly overlapping combination 32d can also be provided, as well as an expanded database 32 c comprisingsaid database segment 32 a and supplementing additional use information.The second set of use information of said database thus formed, is madeavailable to said recognition 4 according to FIG. 3, in order to allow adatabase segment, i.e. an entry in the database, to be associated tosaid recorded first signal (of function 2), that is still present, tothe time interval T1 not determined in the first analysis, or to a newsignal s₂(t). The evaluation provides a new set of recognition data 19comprising a word W, a related confidence value K, and the attribute ofthe semantic category S, which attribute as such remains unchanged. In arequest 20, the confidence value is again compared with the thresholdvalue which, in most cases leads to a recognition result for W,resulting in a requested information via a path 40,50,50 a or 40,50,50 b(and initiation of a technical function).

Path 59 a: An already mentioned second expansion or branchingpossibility of the request 60, whether a semantic category is present,is provided via a signal path or program path 59 a. When a semanticcategory is present, a voice output over a loudspeaker 31 can beprovided, said voice output being adapted to the semantic category. Asurname can specifically be asked for, a place can specifically be askedfor, a department can specifically be asked for, depending on the numberof semantic categories provided. Said voice output is controlled by afunction 30. Said loudspeaker 31 a can be separate, but it can also be aloudspeaker 50 a, which is not separately illustrated in the functionalsequence chart.

Due to the initiated action, the system is adapted to adjust itself as asystem response, such that a second signal input expected to be suppliedby a user, which input corresponds to a second signal flow s_(b)(t), canbe processed more easily, more rapidly and more selectively.Consequently, the system has already finished its adjustment and is nolonger dependent on the reflections of the user. The adjustment ormodification of the system according to the signal path 59 a, 30, 31 ain connection with an iteration 35, which leads to the functional blocks2 and 5, is sufficient for a subsequent optimized recognition. In theeasiest case, the expected signal s_(b)(t) corresponds again and only tothe signal segment s1, if the surname which could not be recognizedbefore, is repeated, because the signal request 31 only asked for aperson. Thus, a new signal is supplied to the recognition module 4 via amicrophone 1, a signal transmission line 1 a, and a recording 2 of thisshorter signal, so that now a new use information 18 comprising a word,a semantic category and a confidence value can be supplied to 19, theconfidence value being substantially higher, preferably even so high asto exceed the threshold value, and to allow a signal output or function50 a, 50 b corresponding to a word W1 to be initiated.

For such an operating process, the database can remain unmodified withrespect to its accessible area. However, it can also be modified to anexpansion or branching 59 according to a functional block 70 in thesense of the above description, within the conditions described inconnection with FIG. 3. For the described process control, this functionis effected by a database selection 5 which makes a new area of useinformation accessible to said module 4 via a control line 4 a,4 b.

In the following, the semantic category shall be described in moredetail.

At 70, the semantic category determines which segment of the database 32is associated for the decoding 4, for recognition of the voice signals_(a) or s_(b). In case of a telephone system, e.g. the attributes name,forename, companies, department, functions are provided in case of atrain information system, the attributes type of train, places, times ofa day (morning, afternoon, night) can be provided. Other attributes canbe allocated (e.g. also languages), depending on the field ofapplication.

The unknown voice signal segment s1 or the new shorter segment s_(b)(t),which was recognizable with respect to its attribute or is known, butnot with respect to its concrete content, is supplied by a decoder via afunction 2, the database 32 being addressed correspondingly (e.g. via acontrol of the addressing area of the decoder 4).

When using the described information system, it is noticed that a useris only rarely interested in all information of the database 32.Therefore, a profile 10 a, 10 b is provided in a profile collection of amemory 10, said profile collection describing, delimiting, orsupplementing a subset of the first segment 32 a of the database. Saidprofile can be either user-oriented or motivated by other structuralproperties of the user. Due to the predetermined different segments 32a,32 b, . . . , a structure of the database 32 is obtained. Theinformation request initially operates with limited information 32 awhich, preferably, however, is user-oriented. Said limited informationis not the end of the available information, but the beginning oflocalizing signal segments which are possibly not present (and whichcannot be associated with words) in other areas of the database.

Initially, however, only said information area 32 a influenced by saidprofile is accessible for the word recognition in said decoder 4. Inorder to avoid confusions between recognized words, an acoustic measureis adopted for word recognition, said acoustic measure being designatedas confidence value K. If the signal segment s1 is recognized with aninsufficient confidence value (said confidence value being below apredetermined threshold value), the searched word is regarded as “notrecognized”, and an attribute evaluation is effected with regard to thesecond attribute of “semantic category”, which is part of therecognition result 19.

Said semantic category is of a very rough nature and is in directconnection with the table entries of the database.

The change-over to another area, in order to allocate the voice segmentnot recognized in the first area, for retrieving the information segmentrequested by the speaker, also operates in multiple stages. Severalfurther areas 32 b,32 c,32 d can be provided which are usedconsecutively.

According to another embodiment, the attribute S1 can predetermine oneof said areas in which the search is continued. When the systemaccording to FIG. 1 reaches the last area, without retrieving an entryW1 corresponding to the first voice signal segment s1 or the informationrelating to said voice signal segment via the acoustic decoder 4, thesystem returns and asks the user to supply the further voice signal. Thenew check of the newly spoken signal is initially carried out in saidfirst information area 32 a, at least single-stage iteratively expandedby the above-described control 70 using the semantic category.

In FIG. 2, the voice output s₃₁(t) is symbolically illustrated inconnection with the system response by a voice output 30,31 a dependingon the semantic category. Said voice output corresponds to the timeinterval T1 and to the signal segment s1 in said time interval. Thecorrespondence is to be understood such that a signal segment T₁* isinherent in the signal flow s₃₁(t), said signal segment T₁* carrying thecorrespondence to said signal segment s1 not recognized with sufficientconfidence. Said “correspondence” is explained by an example. Theunknown signal segment is the surname itself, e.g. “Fischer” or “Jones”.The signal response in case of an insufficient confidence and a voiceoutput according to the recognized semantic category S1, is a request“Which person do you want to talk to?”, the term “person” correspondingto the signal flow in the segment T₁*. Therefore, the correspondence isa semantic category, which is output by voice, in accordance with anon-recognized name, which corresponds to a segment s1. Of course, saidsignal segments are not equal with respect to their temporal flow, theyare only comparable with respect to their meaning, the semantic categoryabstractly defining the content of the signal segment s1.

After the signal output S₃₁ (t), the system expects a second voicesignal s_(b)(t). Said voice signal is of a considerably lowercomplexity. The complete information of the database with several,particularly all information areas 32 a,32 b,32 c,32 d can be madeaccessible to the decoder 4 for decoding. This kind of restrictivedialog strategy permits to only mention one single database field,instead of simultaneously indicating different database fields (such asfor example indicating the name, forename and company name). For arecognition of the content of said single database field by a voicerecognition 4, not only the words of the information segment madeaccessible by the profile 10 a, 10 b in said first information area 32 aare used, but all values of the complete database 32, possibly alsowithout said first area 32 a.

The determinations that the individual information areas 32 a,32 b,32c,32 d are proper subsets of the complete information 32 result from thestructure of the database 32. Normally it is useful to make said firstarea 32 a dependent on the user, as described before. Said dependency isachieved by a profile. Said profile describes the typical behavior of acertain user, when using the dialog system according to FIG. 1. Theprofile specifies (influences) a subset of the accessible database. Whena group of users can be treated identically, said profile applies toseveral users actually having comparable properties.

The described semantic categories characterize the actually notrecognized voice signal segment s1 as an attribute with regard to itsgeneric meaning (its semantic category), without said voice recognitionrecognizing the actually spoken word in said decoder 4. Prior to such anassociation, a meaning (semantic) of a prior known finite quantity ofmeanings is attributed to each word in said database 32, so that only afinite quantity of meanings can be attributed. Each of said meanings isdesignated as semantic category.

Within said recognition module 4, different OOV models can be applied.Different semantic categories also have other OOV models. Such an OOVmodel is an acoustic model in which several (different) part models areconnected in parallel, as shown in FIG. 4. Frequently, all voice modelsare used, a part model generally being designated as P. According toFIG. 4, a multitude of voice models is provided. They can be dependenton the language, in the present embodiment, German voice models(according to German language) being used as a basis. Some such voicemodels are /a/, /a:/, /ai/, /au/, /ax/, /b/, /d/, /eh/, /eh:/, /ey/,/f/, /g/, . . . /r/, /s/, /sh/, /t/, . . . /z/ and /zh/. A loop Z1 isprovided to adapt the OOV model of FIG. 4 to cover a time interval s1 ofvariable length, said loop covering the occurrence of a sequence of partmodels P in said OOV model. At the loop transition Z1, between the leftand the right node, a transition probability is defined which isdesignated W(p1|p2). Said transition probability is between the left andthe right node of FIG. 4. It provides information on the probability oftransition from one voice model p2 to a subsequent voice model p1, eachof which, as a respective part model, is an element of the OOV modelwith all part models P. An OOV model is therefore completely defined bythe quantity of all part models P and by an evaluation orpredetermination of the transition probability W(p1|p2) for all p1,p2 inP.

The different OOV models are applied within the recognition module 4.Different semantic categories also have different OOV models, which isto be illustrated based on an embodiment. For said embodiment, thesemantic categories “Strasse”¹ and “Stadt”² in German language areprovided and shall be compared. A major part of all streets or roads(names of the streets or roads) have a suffix “Strasse”, as for examplethe Goethestrasse. Therefore, transitions probabilities between theelements of said suffix are given, which are as follows:

-   -   W₁(p1=“t”|p2=“s”)    -   W₂(p1=“r”|p2=“t”)    -   W₃(p1=“a”|p2=“r”)    -   w₄(p1=“s”|p2=“a”)    -   W₅(p1=“s”|p2=“s”)    -   W₆(p1=“e”|p2=“s”) ¹remark of translator: German term        “Strasse”=street, road²remark of translator: German term        “Stadt”=city, town

When evaluating the suffix “Strasse”, said probabilities aresubstantially higher in the OOV model of the semantic category“Strasse”, than in the OOV model of the other semantic category “Stadt”.Therefore, a semantic category can be associated.

In addition, a differentiation of the part models P themselves isuseful, if a differentiation of semantic categories of English names andGerman names (as two different semantic categories) is to be used.Instead of the described voice models P according to FIG. 4, Englishvoice models are used, which are not separately illustrated here.

With the OOV models adapted to or predetermined for each semanticcategory, word recognition of database segments not contained in saiddatabase 32 is more robust with respect to a time interval T1, themodeling being more adequate for said semantic category.

If, according to the embodiment of FIG. 1, at said path 59, a voicesignal segment s1 of a first voice signal s_(a)(t) is present, which isnot contained in said first information area 32 a of said database 32, afurther information area of said database is activated as a secondinformation area. The not-contained status corresponds to a recognitionresult 18 provided by said recognition module 4, the semantic categoryand the confidence value being determined, said confidence value,however, not exceeding the threshold value. At least an acoustic modelaccording to FIG. 4 is predetermined, particularly, however, severalacoustic models being provided, each of which forms an individualsemantic category or is attributed to said category. As described, thedifferent semantic categories can also be determined by differentlanguages, when selecting voice models specific for a language.

The method according to FIG. 1 can also be operated such that anon-attributable voice signal segment s1 in said first information area32 a of said database 32 is followed by an unspecific voice output as are-inquiry, which is made via a channel 80,31. Thereupon, the systemdoes not modify. A further analysis with a signal similar to said signals_(a)(t) shall provide a reliable recognition of the now expected secondvoice signal to be obtained. Thus, the system is adapted to adjustitself without interference of human intellectual action, only based onpredetermined system processes to obtain a subsequent reliablerecognition that is more robust and more rapid. An intellectualcontribution of the user is not decisive, his reaction with respect toits perplexity being in any case substantially reduced due to the outputof the re-inquiry signal at 31 or 31 a, so that only for this reason, animprovement of the recognizability is obtained. The loudspeakers 31,31 acan also be identical.

As described above, the complete area of the database can be accessiblefor the substantially reduced perplexity of the second voice signal,however, a process can also be initiated, according to which said secondvoice signal is again checked by said voice model 4 and said first dataarea 32 a of information with respect to retrieving an entry.

The described variants can also be combined, such as the iterativedetermination with return to said first information area 32 a, with orwithout dialog-inquiry 31 or 31 a.

1. A method for controlling an information system during an output ofstored information segments via a signaling device, the methodcomprising the steps of storing information in a database for beingrequested; providing a first voice signal having a first voice signalsegment and providing a voice recognition module for decoding the firstvoice signal; organizing the information in said database such that alimited first information area of stored information is initiallyaccessible to said first voice signal segment; specifying at least oneinformation segment from the stored information as a first data segmentby a decoded first voice signal segment and provide the specifiedinformation segment as output; activating a further information area ofsaid database as a second information area from the stored information,when said first data segment corresponding to the first voice signalsegment of said first voice signal is not contained in said limitedfirst information area, wherein the activation of said secondinformation area of the database is at least co-dependent on a semanticcategory of said voice signal segment; said semantic category beingdetermined by the voice recognition module when decoding the voicesignal segment of the first voice signal, to associate the semanticcategory to the voice signal segment.
 2. The method of claim 1, whereinthe output of stored information is communicated as a control output. 3.The method of claim 1, wherein said output of information is controllinga technical device.
 4. The method of claim 1, wherein after activatingsaid further information area, a third information area is activated ifsaid specified information segment is also not available from saidfurther activated information area.
 5. The method of claim 1, whereinsaid first information area or said second information area is a realsubset of the information stored in said database.
 6. The method ofclaim 1, wherein said first limited information area of storedinformation of said database is at least partly determined according toa user profile.
 7. The method of claim 1, wherein there is provided athreshold value or a confidence value calculated from said voice signalby a decoder, to allow a comparison.
 8. The method of claim 1, wherein astatus of not-containing said first data segment results from arecognition of acoustic models, each of said acoustic models beingassociated to an individual semantic category.
 9. The method of claim 1,wherein said voice signal is decoded in a decoder using acoustic models,for providing a confidence value in said semantic category for saidfirst voice signal segment of the first voice signal.
 10. The method ofclaim 1, wherein a threshold value is provided for testing a confidencevalue determined by a voice recognition as an acoustic decoder, whereinsaid confidence value is compared to said threshold value.
 11. Themethod of claim 10, wherein said comparison with said threshold value iseffected, and depending on the comparison, the semantic category of acertain word being related to said first voice signal segment isevaluated.
 12. The method of claim 1, wherein an information segmentspecified by said first voice signal segment is output via a controloutput, when a voice segment characterizing said specified informationsegment is found as a word in one of said first limited informationarea, said second information area and another information areaactivated later.
 13. The method of claim 12, wherein the activation ofsaid second information area depends on the semantic category of saidvoice signal segment, which voice signal segment having no correspondentin said limited first information area.
 14. The method of claim 1,wherein the activation of said further information area depends on thesemantic category of said voice signal segment, which voice signalsegment had no corresponding word in said limited first information areaas the voice signal segment could not be determined with a sufficientlevel of confidence.
 15. The method of claim 1, wherein after activatingsaid further information area, only a shortened voice signal segment isevaluated in a decoder, for specifying in said further information areaan information segment corresponding to said voice signal segment. 16.The method of claim 15, wherein the shortened voice signal segment isthe voice signal segment of said first voice signal, evaluated in thedecoder, for specifying the corresponding information segment.
 17. Themethod of claim 1, wherein after detecting a confidence value and acertain semantic category of said voice signal segment, a perceivablesignal is output, said perceivable signal being dependent on thesemantic category, if the confidence value is too small.
 18. The methodof claim 1, wherein said activated further information area is differentfrom said first limited information area by one of larger, offset, andfully different.
 19. The method of claim 1, wherein after evaluating thesemantic category by said voice recognition module, a further voicesignal segment is supplied for being evaluated having less perplexity.20. The method of claim 19, wherein further voice signal segmentsupplied for being evaluated, has temporal length thereof beingshortened with respect to said first voice signal evaluated by saidvoice recognition module.
 21. The method of claim 1, wherein theactivation of said second information area depends on the semanticcategory of said voice signal segment, which voice signal segment couldnot be determined, recognized, specified or selected with sufficientconfidence.
 22. A method for controlling an information system during anoutput of stored information segments via a signaling device, the methodcomprising the steps of storing information in a database suitable forbeing requested; providing a voice recognition system; specifying aninformation segment as a first data segment via a first voice signalprovided to said voice recognition system from a user; providing thefirst data segment as a control output; organizing the information insaid database such that initially only a first information area ofstored information is accessible to said first voice signal, forselecting said specified information segment from said first informationarea; associating a semantic category as one of a group of properties toa voice signal segment within said voice signal with no dialoguerequests to said user, when an information segment corresponding to saidvoice signal segment is not contained in said first information area;and activating a second information area, wherein said second activatedinformation area is different from said first accessible informationarea and dependent on the semantic category, as associated to said voicesignal segment.
 23. The method of claim 22, wherein after evaluating thesemantic category by said voice recognition system, a further voicesignal segment is supplied for being evaluated1 a temporal length of thefurther segment being shortened with respect to said first voice signalevaluated by said voice recognition system.
 24. A method for controllinga technical device via a voice initiated action, comprising the steps ofstoring information in a database suitable for being requested;providing a voice recognition system; specifying an information segmentas a first data segment via a first voice signal provided to said voicerecognition system from a user; providing the first data segment as aconverted control signal for said technical device; organizing theinformation in said database such that initially only a firstinformation area of stored information is accessible to said first voicesignal, for selecting said specified information segment from said firstinformation area; associating a semantic category as one of a group ofproperties to a voice signal segment within said voice signal with nodialogue requests to said user that provided the first voice signal,when an information segment corresponding to said voice signal segmentof said first voice signal is not contained in said first informationarea; and activating a second information area, wherein said secondactivated information area is different from said first informationarea, and dependent on the semantic category, as associated to saidvoice signal segment.
 25. A method for controlling a technical devicevia a voice initiated action, the method comprising the steps of storinginformation in a database for being requested; providing a first voicesignal having a first voice signal segment and providing a voicerecognition module for decoding the voice signal; organizing theinformation in said database such that a limited first information areaof stored information is initially accessible to said first voice signalsegment; specifying at least one information segment from the storedinformation as a first data segment by a decoded first voice signalsegment and provide the specified information segment as output;activating a further information area of said database as a secondinformation area from the stored information, when said first datasegment corresponding to the first voice signal segment of said firstvoice signal is not contained in said limited first information area,wherein the activation of the second information area of the database isat least co-dependent on a semantic category of said voice signalsegment; said semantic category being determined by the voicerecognition module when decoding the voice signal segment of the firstvoice signal, to associate the semantic category to the voice signalsegment.
 26. The method of claim 25, wherein the output is convertedinto a control signal for said technical device.